Lower PAC bound on Upper Confidence Bound-based Q-learning with examples
نویسندگان
چکیده
Abstract Recently, there has been significant progress in understanding reinforcement learning in Markov decision processes (MDP). We focus on improving Q-learning and analyze its sample complexity. We investigate the performance of tabular Q-learning, Approximate Q-learning and UCB-based Q-learning. We also derive a lower PAC bound Ω( |S| |A| 2 ln |A| δ ) of UCB-based Q-learning. Two tasks, CartPole and Pac-Man, are each solved using these three methods. Some results and discussion are presented at last. UCB-based learning does better in exploration but lose its advantage in exploitation, compared to its alternatives.Recently, there has been significant progress in understanding reinforcement learning in Markov decision processes (MDP). We focus on improving Q-learning and analyze its sample complexity. We investigate the performance of tabular Q-learning, Approximate Q-learning and UCB-based Q-learning. We also derive a lower PAC bound Ω( |S| |A| 2 ln |A| δ ) of UCB-based Q-learning. Two tasks, CartPole and Pac-Man, are each solved using these three methods. Some results and discussion are presented at last. UCB-based learning does better in exploration but lose its advantage in exploitation, compared to its alternatives.
منابع مشابه
A New Lower Bound for Completion Time Distribution Function of Stochastic PERT Networks
In this paper, a new method for developing a lower bound on exact completion time distribution function of stochastic PERT networks is provided that is based on simplifying the structure of this type of network. The designed mechanism simplifies network structure by arc duplication so that network distribution function can be calculated only with convolution and multiplication. The selection of...
متن کاملA New Lower Bound for Completion Time Distribution Function of Stochastic PERT Networks
In this paper, a new method for developing a lower bound on exact completion time distribution function of stochastic PERT networks is provided that is based on simplifying the structure of this type of network. The designed mechanism simplifies network structure by arc duplication so that network distribution function can be calculated only with convolution and multiplication. The selection of...
متن کاملOn the Sample Complexity of Noise-Tolerant Learning
In this paper, we further characterize the complexity of noise-tolerant learning in the PAC model. Specifically, we show a general lower bound of Ω ( log(1/δ) ε(1−2η) ) on the number of examples required for PAC learning in the presence of classification noise. Combined with a result of Simon, we effectively show that the sample complexity of PAC learning in the presence of classification noise...
متن کاملPAC Bounds for Discounted MDPs
We study upper and lower bounds on the sample-complexity of learning nearoptimal behaviour in finite-state discounted Markov Decision Processes (mdps). We prove a new bound for a modified version of Upper Confidence Reinforcement Learning (ucrl) with only cubic dependence on the horizon. The bound is unimprovable in all parameters except the size of the state/action space, where it depends line...
متن کاملNear-optimal PAC bounds for discounted MDPs
We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finite-state discounted Markov Decision Processes (MDPs). We prove a new bound for a modified version of Upper Confidence Reinforcement Learning (UCRL) with only cubic dependence on the horizon. The bound is unimprovable in all parameters except the size of the state/action space, where it depends lin...
متن کامل